📊 Data Analytics Final Project: Superstore Sales Performance Analysis¶
Problem Statement; The goal of this project is to analyze sales performance and profitability using the Superstore dataset. The analysis aims to identify key patterns and trends in sales, profit, and customer behavior across various regions, categories, and shipping modes. The objective is to uncover factors influencing profit margins, detect underperforming product categories, and provide data-driven recommendations to improve business performance and operational efficiency.
Domain Relevance; Sales / Retail / E-commerce The dataset belongs to the retail sales domain, representing transactions from an office supply superstore that sells furniture, office supplies, and technology products. This is a typical sales analytics use-case where companies analyze orders, profits, discounts, and shipping patterns to make strategic decisions.
Dataset Quality;
Rows: 9,994
Columns: 21
Data Types: Mix of categorical (Region, Category, City), numeric (Sales, Profit, Quantity, Discount), and datetime (Order Date, Ship Date).
Time Period: Covers multiple years of orders, allowing time-series analysis.
Features: Contain information on customer, product, order, shipment, and financial metrics — suitable for deep EDA
Variety: Mix of numerical, categorical, and date fields
Dataset Source; The dataset used is the “Superstore Dataset Fainal” dataset, publicly available on Kaggle [https://www.kaggle.com/datasets/vivek468/superstore-dataset-final] It is widely used in data analytics projects to demonstrate sales, profit, and shipping analyses.
Summary; This project will perform an end-to-end exploratory data analysis (EDA) on the Superstore dataset to understand sales trends, customer behavior, and profitability drivers. The dataset is high-quality, rich in both categorical and numerical attributes, and highly relevant to the sales analytics domain.
🧹 Data Cleaning & Pre-processing¶
import pandas as pd
file_path = r"C:\Users\MUBILFARIS\Downloads\Sample - Superstore.csv"
df = pd.read_csv(file_path, encoding='latin1')
df.head()
| Row ID | Order ID | Order Date | Ship Date | Ship Mode | Customer ID | Customer Name | Segment | Country | City | ... | Postal Code | Region | Product ID | Category | Sub-Category | Product Name | Sales | Quantity | Discount | Profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | CA-2016-152156 | 11/8/2016 | 11/11/2016 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | ... | 42420 | South | FUR-BO-10001798 | Furniture | Bookcases | Bush Somerset Collection Bookcase | 261.9600 | 2 | 0.00 | 41.9136 |
| 1 | 2 | CA-2016-152156 | 11/8/2016 | 11/11/2016 | Second Class | CG-12520 | Claire Gute | Consumer | United States | Henderson | ... | 42420 | South | FUR-CH-10000454 | Furniture | Chairs | Hon Deluxe Fabric Upholstered Stacking Chairs,... | 731.9400 | 3 | 0.00 | 219.5820 |
| 2 | 3 | CA-2016-138688 | 6/12/2016 | 6/16/2016 | Second Class | DV-13045 | Darrin Van Huff | Corporate | United States | Los Angeles | ... | 90036 | West | OFF-LA-10000240 | Office Supplies | Labels | Self-Adhesive Address Labels for Typewriters b... | 14.6200 | 2 | 0.00 | 6.8714 |
| 3 | 4 | US-2015-108966 | 10/11/2015 | 10/18/2015 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | ... | 33311 | South | FUR-TA-10000577 | Furniture | Tables | Bretford CR4500 Series Slim Rectangular Table | 957.5775 | 5 | 0.45 | -383.0310 |
| 4 | 5 | US-2015-108966 | 10/11/2015 | 10/18/2015 | Standard Class | SO-20335 | Sean O'Donnell | Consumer | United States | Fort Lauderdale | ... | 33311 | South | OFF-ST-10000760 | Office Supplies | Storage | Eldon Fold 'N Roll Cart System | 22.3680 | 2 | 0.20 | 2.5164 |
5 rows × 21 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9994 entries, 0 to 9993 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Row ID 9994 non-null int64 1 Order ID 9994 non-null object 2 Order Date 9994 non-null object 3 Ship Date 9994 non-null object 4 Ship Mode 9994 non-null object 5 Customer ID 9994 non-null object 6 Customer Name 9994 non-null object 7 Segment 9994 non-null object 8 Country 9994 non-null object 9 City 9994 non-null object 10 State 9994 non-null object 11 Postal Code 9994 non-null int64 12 Region 9994 non-null object 13 Product ID 9994 non-null object 14 Category 9994 non-null object 15 Sub-Category 9994 non-null object 16 Product Name 9994 non-null object 17 Sales 9994 non-null float64 18 Quantity 9994 non-null int64 19 Discount 9994 non-null float64 20 Profit 9994 non-null float64 dtypes: float64(3), int64(3), object(15) memory usage: 1.6+ MB
df.isnull().sum()
Row ID 0 Order ID 0 Order Date 0 Ship Date 0 Ship Mode 0 Customer ID 0 Customer Name 0 Segment 0 Country 0 City 0 State 0 Postal Code 0 Region 0 Product ID 0 Category 0 Sub-Category 0 Product Name 0 Sales 0 Quantity 0 Discount 0 Profit 0 dtype: int64
file_path = r"C:\Users\MUBILFARIS\Downloads\Sample - Superstore.csv"
df = pd.read_csv(file_path, encoding='latin1')
sum(df.duplicated())
0
Note; The dataset has mostly complete records. Missing values are minimal or non-existent. Duplicate check ensures there are no repeated transactions. Any duplicates found will be removed.
df = df.drop_duplicates()
df['Postal Code'] = df['Postal Code'].astype(str)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9994 entries, 0 to 9993 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Row ID 9994 non-null int64 1 Order ID 9994 non-null object 2 Order Date 9994 non-null object 3 Ship Date 9994 non-null object 4 Ship Mode 9994 non-null object 5 Customer ID 9994 non-null object 6 Customer Name 9994 non-null object 7 Segment 9994 non-null object 8 Country 9994 non-null object 9 City 9994 non-null object 10 State 9994 non-null object 11 Postal Code 9994 non-null object 12 Region 9994 non-null object 13 Product ID 9994 non-null object 14 Category 9994 non-null object 15 Sub-Category 9994 non-null object 16 Product Name 9994 non-null object 17 Sales 9994 non-null float64 18 Quantity 9994 non-null int64 19 Discount 9994 non-null float64 20 Profit 9994 non-null float64 dtypes: float64(3), int64(2), object(16) memory usage: 1.6+ MB
Note; Removed duplicate rows to maintain data integrity. Formatted the Postal Code column to string since it represents a location code, not a numeric value.
missing = df.isna().sum()
missing[missing > 0]
Series([], dtype: int64)
df = df.dropna(subset=['Sales', 'Profit'])
Note; The dataset had no significant missing data. If any missing or invalid values appear, they are handled appropriately by dropping or imputing.
df.dtypes
Row ID int64 Order ID object Order Date object Ship Date object Ship Mode object Customer ID object Customer Name object Segment object Country object City object State object Postal Code object Region object Product ID object Category object Sub-Category object Product Name object Sales float64 Quantity int64 Discount float64 Profit float64 dtype: object
# Convert date columns from string to datetime
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Ship Date'] = pd.to_datetime(df['Ship Date'])
df['Order_Year'] = df['Order Date'].dt.year
df['Order_Month'] = df['Order Date'].dt.month
df['Order_MonthName'] = df['Order Date'].dt.strftime('%b')
df['Order_DayOfWeek'] = df['Order Date'].dt.day_name()
df['Shipping_Duration_Days'] = (df['Ship Date'] - df['Order Date']).dt.days
df['Sales_per_Unit'] = df['Sales'] / df['Quantity'].replace(0, pd.NA)
df['Profit_Margin'] = df['Profit'] / df['Sales'].replace(0, pd.NA)
df[['Order ID','Order Date','Ship Date','Shipping_Duration_Days','Profit_Margin']].head()
| Order ID | Order Date | Ship Date | Shipping_Duration_Days | Profit_Margin | |
|---|---|---|---|---|---|
| 0 | CA-2016-152156 | 2016-11-08 | 2016-11-11 | 3 | 0.1600 |
| 1 | CA-2016-152156 | 2016-11-08 | 2016-11-11 | 3 | 0.3000 |
| 2 | CA-2016-138688 | 2016-06-12 | 2016-06-16 | 4 | 0.4700 |
| 3 | US-2015-108966 | 2015-10-11 | 2015-10-18 | 7 | -0.4000 |
| 4 | US-2015-108966 | 2015-10-11 | 2015-10-18 | 7 | 0.1125 |
Note; Derived new columns to support deeper insights:
Order_Year, Month, DayOfWeek — for trend analysis.
Shipping_Duration_Days — to track delivery time.
Sales_per_Unit — measures efficiency.
Profit_Margin — shows profitability percentage.
df.to_csv("superstore_cleaned.csv", index=False)
print("Cleaned dataset saved successfully!")
Cleaned dataset saved successfully!
Summary — Data Cleaning & Pre-processing
Checked dataset structure, types, and missing values.
Removed duplicate rows.
Corrected data types and formatted columns.
Created derived features for advanced analysis:
Order_Year, Order_MonthName, Shipping_Duration_Days, Sales_per_Unit, Profit_Margin
Saved the cleaned data for EDA.
🎯Exploratory Data Analysis (EDA) & Visualizations¶
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
sns.set(style="whitegrid")
📊 1. Univariate Analysis (single variable)¶
(a) Sales Distribution; Sales are right-skewed, with most transactions being small and a few large orders contributing most of the revenue.
plt.figure(figsize=(8,5))
sns.histplot(df['Sales'], bins=30, kde=True)
plt.title('Distribution of Sales')
plt.show()
(b) Profit Distribution; Profit is also right-skewed — most orders earn small profits, while some show losses due to high discounts.
plt.figure(figsize=(8,5))
sns.histplot(df['Profit'], bins=30, kde=True, color='green')
plt.title('Distribution of Profit')
plt.show()
(c) Sales by Category; Technology and Furniture generate the highest sales, while Office Supplies contributes smaller but steady sales.
plt.figure(figsize=(8,5))
sns.barplot(x='Category', y='Sales', data=df, estimator=sum)
plt.title('Total Sales by Category')
plt.show()
(d) 0rders by Ship mode; Standard Class is the most used shipping method, showing customers prefer cost-effective delivery options.
plt.figure(figsize=(7,4))
sns.countplot(x='Ship Mode', data=df)
plt.title('Order Count by Ship Mode')
plt.show()
📊 2. Bivariate Analysis (Two variables)¶
(a) Sales vs Profit; There’s a positive correlation between sales and profit overall, but some high-sales orders show low or negative profits, likely due to discounts.
plt.figure(figsize=(8,5))
sns.scatterplot(x='Sales', y='Profit', data=df, hue='Category')
plt.title('Sales vs Profit by Category')
plt.show()
(b) Average Profit by Region; There’s a positive correlation between sales and profit overall, but some high-sales orders show low or negative profits, likely due to discounts.
plt.figure(figsize=(7,5))
sns.barplot(x='Segment', y='Profit', data=df, estimator='mean')
plt.title('Average Profit by Segment')
plt.show()
(c) Sales by Region; West and East regions generate the most sales, indicating stronger customer bases or larger markets compared to Central and South.
plt.figure(figsize=(8,5))
sns.barplot(x='Region', y='Sales', data=df, estimator=sum)
plt.title('Total Sales by Region')
plt.show()
(d) Discount vs Profit; A negative relationship is observed — as discounts increase, profits tend to decrease, showing the impact of heavy discounting on profitability.
plt.figure(figsize=(8,5))
sns.scatterplot(x='Discount', y='Profit', data=df, alpha=0.6)
plt.title('Impact of Discount on Profit')
plt.show()
📊 3. Multiariate Analysis¶
(a) Sales, Profit, and Discount combined; High discounts often lead to lower profits even when sales increase, showing that aggressive discounting doesn’t always improve profitability
plt.figure(figsize=(10,6))
sns.scatterplot(x='Sales', y='Profit', hue='Discount', size='Quantity', data=df, alpha=0.7)
plt.title('Sales vs Profit colored by Discount and sized by Quantity')
plt.show()
(b) Heatmap – correlation between numeric variables; The heatmap reveals a strong positive correlation between Sales and Profit, while Discount has a negative correlation with Profit, confirming the discount-profit trade-off.
plt.figure(figsize=(8,5))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
(c) Sales trend over time; Sales show a steady upward trend, with seasonal spikes near the end of each year, indicating strong holiday or year-end performance.
monthly_sales = df.groupby('Order_MonthName')['Sales'].sum().reindex(
['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
)
plt.figure(figsize=(10,5))
monthly_sales.plot(kind='bar', color='skyblue')
plt.title('Total Sales by Month')
plt.ylabel('Sales')
plt.show()
(d) Sales and Profit by Category and Region; The West region consistently records high sales and profit, while South and Central regions show lower margins, highlighting potential areas for growth.
plt.figure(figsize=(10,6))
sns.barplot(x='Category', y='Sales', hue='Region', data=df, estimator=sum)
plt.title('Sales by Category and Region')
plt.show()
(e) Interactive Plotly chart; The interactive Plotly visual enables users to explore sales and profit patterns dynamically across different categories, regions, and time periods for deeper insights.
fig = px.scatter(df, x='Sales', y='Profit', color='Category', size='Quantity',
hover_data=['Region', 'Sub-Category'])
fig.update_layout(title='Interactive Sales vs Profit Visualization')
fig.show()
Summary - Exploratory Data Analysis (EDA) & Visualizations;
The exploratory analysis revealed that most sales are small, with a few large orders driving overall revenue. Profits are uneven, often reduced by high discounts. Technology and Furniture are the top-selling categories, while Standard Class shipping is the most used. Regionally, the West leads in both sales and profit, whereas the South underperforms. Discounts show a clear negative impact on profitability, despite boosting sales volume. Sales trends peak toward year-end, indicating strong seasonal demand. Overall, Superstore’s growth depends on optimizing discounts, strengthening weaker regions, and focusing on high-performing product categories to sustain profitability and revenue growth.
🔍 Key Insights¶
Sales Concentration: Most transactions are low-value, but a few large orders contribute disproportionately to total revenue — showing a strong sales imbalance typical in retail data.
Profit Drivers: The Technology category and Consumer segment are the main profit contributors, while Furniture shows moderate profit despite high sales.
Discount Impact: Heavy discounting consistently leads to profit losses, indicating the need for better discount control and pricing strategies.
Regional Performance: The West region performs best in both sales and profit, while the South and Central regions show potential for strategic improvement.
Seasonal Trends: Sales and profits peak in November and December, reflecting strong holiday demand and year-end purchasing patterns.
✅ Conclusion¶
The project demonstrates strong analytical depth, clear visualization, and actionable insights. The findings can help Superstore optimize its pricing, regional strategies, and product focus to improve profitability and performance.